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Abstract 

Lifelong reinforcement learning provides a 
promising framework for developing versatile 
agents that can accumulate knowledge over a 
lifetime of experience and rapidly learn new 
tasks by building upon prior knowledge. How¬ 
ever, cun'ent lifelong learning methods exhibit 
non-vanishing regret as the amount of experience 
increases, and include limitations that can lead to 
suboptimal or unsafe control policies. To address 
these issues, we develop a lifelong policy gra¬ 
dient learner that operates in an adversarial set¬ 
ting to learn multiple tasks online while enforc¬ 
ing safety constraints on the learned policies. We 
demonstrate, for the first time, sublinear regret 
for lifelong policy search, and validate our algo¬ 
rithm on several benchmark dynamical systems 
and an application to quadrotor control. 

1. Introduction 

Reinforcement learning (RL) (Busoniu et al., 2010; Sutton 
& Barto, 1998) often requires substantial experience be¬ 
fore achieving acceptable performance on individual con¬ 
trol problems. One major contributor to this issue is the 
tabula-rasa assumption of typical RL methods, which learn 
from scratch on each new task. In these settings, learning 
performance is directly correlated with the quality of the 
acquired samples. Unfortunately, the amount of experience 
necessary for high-quality performance increases exponen¬ 
tially with the tasks’ degrees of freedom, inhibiting the ap¬ 
plication of RL to high-dimensional control problems. 

When data is in limited supply, transfer learning can signifi¬ 
cantly improve model performance on new tasks by reusing 
previous learned knowledge during training (Taylor & 
Stone, 2009; Gheshlaghi Azar et al., 2013; Lazaric, 2011; 
Ferrante et al., 2008; Bou Ammar et al., 2012). Multi¬ 
task learning (MTL) explores another notion of knowl¬ 
edge transfer, in which task models are trained simultane- 
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ously and share knowledge during the joint learning pro¬ 
cess (Wilson et al., 2007; Zhang et al., 2008). 

In the lifelong learning setting (Thrun & O’Sullivan, 
1996a;b), which can be framed as an online MTL prob¬ 
lem, agents acquire knowledge incrementally by learning 
multiple tasks consecutively over their lifetime. Recently, 
based on the work of Ruvolo & Eaton (2013) on super¬ 
vised lifelong learning, Bou Ammar et al. (2014) devel¬ 
oped a lifelong learner for policy gradient RL. To ensure 
efficient learning over consecutive tasks, these works em¬ 
ploy a second-order Taylor expansion around the parame¬ 
ters that are (locally) optimal for each task without trans¬ 
fer. This assumption simplifies the MTL objective into a 
weighted quadratic form for online learning, but since it is 
based on single-task learning, this technique can lead to pa¬ 
rameters far from globally optimal. Consequently, the suc¬ 
cess of these methods for RL highly depends on the pol¬ 
icy initializations, which must lead to near-optimal trajec¬ 
tories for meaningful updates. Also, since their objective 
functions average loss over all tasks, these methods exhibit 
non-vanishing regrets of the form 0{R), where R is the 
total number of rounds in a non-adversarial setting. 

In addition, these methods may produce control policies 
with unsafe behavior (i.e., capable of causing damage to 
the agent or environment, catastrophic failure, etc.). This is 
a critical issue in robotic control, where unsafe control poli¬ 
cies can lead to physical damage or user injury. This prob¬ 
lem is caused by using constraint-free optimization over the 
shared knowledge during the transfer process, which may 
lead to uninformative or unbounded policies. 

In this paper, we address these issues by proposing the first 
safe lifelong learner for policy gradient RL operating in an 
adversarial framework. Our approach rapidly learns high- 
performance safe control policies based on the agent’s pre¬ 
viously learned knowledge and safety constraints on each 
task, accumulating knowledge over multiple consecutive 
tasks to optimize overall performance. We theoretically an¬ 
alyze the regret exhibited by our algorithm, showing sub¬ 
linear dependency of the form 0{\/R) for R rounds, thus 
outperforming current methods. We then evaluate our ap¬ 
proach empirically on a set of dynamical systems. 
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2. Background 

2.1. Reinforcement Learning 

An RL agent sequentially chooses actions to minimize its 
expected cost. Such problems are formalized as Markov de¬ 
cision processes (MDPs) , 7^, c, 7), where A' C is 
the (potentially infinite) state space, U G is the set 

of all possible actions, V : X x U x X [0,1] is a 

state transition probability describing the system’s dynam¬ 
ics, c : X X U X X —T^Kis the cost function measuring 
the agent’s performance, and 7 G [0,1] is a discount fac¬ 
tor. At each time step m, the agent is in state Xm G X 
and must choose an action Um G U, transitioning it to a 
new state x^+i ~ V (xm+i\xm, Um) and yielding a cost 
Cjn+i = c(xm+i,Um,Xm)- The Sequence of state-action 
pairs forms a trajectory t = \xq.m-i,Uq.m-i] over a 
(possibly infinite) horizon M. A policy n ■. X xU ^ \0,1] 
specifies a probability distribution over state-action pairs, 
where tt {u\x) represents the probability of selecting an ac¬ 
tion u in state x. The goal of RL is to find an optimal policy 
TT* that minimizes the total expected cost. 

Policy search methods have shown success in solving 
high-dimensional problems, such as robotic control (Kober 
& Peters, 2011; Peters & Schaal, 2008a; Sutton et al., 
2000). These methods represent the policy 7ra{u\x) using 
a vector a G of control parameters. The optimal policy 
TT* is found by determining the parameters a* that mini¬ 
mize the expected average cost: 

71 

^(«) = , (1) 

k^l 

where n is the total number of trajectories, and 
and are the probability and cost of trajectory 

M-l 

Poc = Vq V 

X TToc 

M-l 

m—0 

with an initial state distribution Vq : X ^ [Q,\]. We han¬ 
dle a constrained version of policy search, in which op¬ 
timality not only corresponds to minimizing the total ex¬ 
pected cost, but also to ensuring that the policy satisfies 
safety constraints. These constraints vary between applica¬ 
tions, for example corresponding to maximum joint torque 
or prohibited physical positions. 

2.2. Online Learning & Regret Analysis 

In this paper, we employ a special form of regret minimiza¬ 
tion games, which we briefly review here. A regret min¬ 
imization game is a triple {IC,J^,R), where /C is a non¬ 
empty decision set, X is the set of moves of the adversary 


which contains bounded convex functions from M" to K, 
and R is the total number of rounds. The game proceeds 
in rounds, where at each round j = 1,... ,R, the agent 
chooses a prediction 6j G K, and the environment (i.e., the 
adversary) chooses a loss function Ij G R. At the end of the 
round, the loss function Ij is revealed to the agent and the 
decision 6j is revealed to the environment. In this paper, 
we handle the full-information case, where the agent may 
observe the entire loss function Ij as its feedback and can 
exploit this in making decisions. The goal is to minimize 

the cumulative regret (^i)-infueK: h (“) 

When analyzing the regret of our methods, we use a variant 
of this definition to handle the lifelong RL case: 


R 


^R = y^lt,{e,)- inf 
’ u&K 

i=i 


R 
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where It. (•) denotes the loss of task t at round j. 


For our framework, we adopt a variant of regret minimiza¬ 
tion called “Follow the Regularized Leader,” which mini¬ 
mizes regret in two steps. First, the unconstrained solution 
9 is determined (see Sect. 4.1) by solving an unconstrained 
optimization over the accumulated losses observed so far. 
Given 6, the constrained solution is then determined by 
learning a projection into the constraint set via Bregman 
projections (see Abbasi-Yadkori et al. (2013)). 


3. Safe Lifelong Policy Search 

We adopt a lifelong learning framework in which the agent 
learns multiple RL tasks consecutively, providing it the op¬ 
portunity to transfer knowledge between tasks to improve 
learning. Let T denote the set of tasks, each element of 
which is an MDR At any time, the learner may face any 
previously seen task, and so must strive to maximize its 
performance across all tasks. The goal is to learn optimal 
policies TT* *,..., TT* ♦ for all tasks, where policy tt* * for 

-1- CX-^> ' “|T| “t 

task t is parameterized by G In addition, each 
task is equipped with safety constraints to ensure accept¬ 
able policy behavior: AtCXt < R, with At G and 

bt G representing the allowed policy combinations. The 
precise form of these constraints depends on the application 
domain, but this formulation supports constraints on (e.g.) 
joint torque, acceleration, position, etc. 

At each round j, the learner observes a set of rit^ trajec¬ 
tories ,... ,Tt^ ^I from a task tj G T, where each 

trajectory has length Mt . To support knowledge transfer 
between tasks, we assume that each task’s policy parame¬ 
ters ottj G at round j can be written as a linear combi¬ 
nation of a shared latent basis L G with coefficient 

vectors G therefore, a.t- = Lst^. Each column 

of L represents a chunk of transferrable knowledge; this 
task construction has been used successfully in previous 
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multi-task learning work (Kumar & Daume III, 2012; Ru- 
volo & Eaton, 2013; Bou Ammar et al., 2014). Extending 
this previous work, we ensure that the shared knowledge 
repository is “informative” by incorporating bounding con¬ 
straints on the Erobenius norm || • ||f of X. Consequently, 
the optimization problem after observing r rounds is: 

r 

(Lst^)] + ^^l\\S\\l + f^ 2 \\L\\l (4) 

i=i 

s.t. At-ottj < bt- ytj G Ir 

'^min (^LL ^ ^ p and (^LL j ^ Q , 
where p and q are the constraints on ||X||f, G K are 
design weighting parameters', Ir = {fi,..., tr} denotes 
the set of all tasks observed so far through round r, and S 
is the collection of all coefficients 

«h;,w£ . Gi)' 


The loss function It (a* ) in Eq. (4) corresponds to a pol¬ 
icy gradient learner for task tj, as defined in Eq. (1). Typi¬ 
cal policy gradient methods (Kober & Peters, 2011; Sutton 
et al., 2000) maximize a lower bound of the expected cost 
It - {cxt -), which can be derived by taking the logarithm and 
applying Jensen’s inequality: 


\og[lt,{at,)] =log 
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Therefore, our goal is to minimize the following objective: 


/ „ "G 


--E Ei°g 

1 nt- ^ 

\ J k—l m—0 



+ Ml 11 ‘S'IIf + M2 II-^'IIf 


( 6 ) 


s.t. At.at - < bt. ytj G Ir 

Amin {IL ^ ^ p and Amax i^LL ^ I: Q- 


3.1. Online Formulation 

The optimization problem above can be mapped to the stan¬ 
dard online learning framework by unrolling L and S into 
a vector 6 — [vec(X) vecjS)]^ G Choosing 

^o{9) = M 2 E-=i Of + All ESfcll' Of , and 0,(6) = 
0,_i(6) + Vtj hj (0), we can write the safe lifelong policy 
search problem (Eq. (6)) as: 

6 r+i = argminOr.( 6 ) , ( 7 ) 

ogk, 

where JC C is the set of allowable policies under 

the given safety constraints. Note that the loss for task A, 

'We describe later how to set the p’s later in Sect. 5 to obtain 
regret bounds, and leave them as variables now for generality. 


can be written as a bilinear product in 6: 


= E Eiog 

nt ^' 

fe=l m=0 
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(fc. tj) I t 


Oi . 

■ 0d(k-i)+i 





’ ®«G “ 


_0d ■ 



_ ^(d+l)fc+l _ 


We see that the problem in Eq. (7) is equivalent to Eq. (6) 
by noting that at r rounds, O,. = Ey=i ytjhj (0) + Oo(6). 

4. Online Learning Method 

We solve Eq. (7) in two steps. Eirst, we determine the 
unconstrained solution Or+i when K, — (see 

Sect. 4.1). Given 6 r+i, we derive the constrained solution 

9r+i by learning a projection Projj^^ ^ ^6r-Fi) to the con¬ 
straint set 1C C K‘tfc-i-fc|r|^ which amounts to minimizing 
the Bregman divergence over nr{ 6 ) (see Sect. 4.2)^. The 
complete approach is given in Algorithm 1 and is available 
as a software implementation on the authors’ websites. 

4.1. Unconstrained Policy Solution 

Although Eq. (6) is not jointly convex in both L and S, it 
is separably convex (for log-concave policy distributions). 
Consequently, we follow an alternating optimization ap¬ 
proach, first computing L while holding S fixed, and then 
updating S given the acquired L. We detail this process for 
two popular PG learners, eREINPORCE (Williams, 1992) 
and eNAC (Peters & Schaal, 2008b). The derivations of the 
update rules below can be found in Appendix A. 

These updates are governed by learning rates /3 and A that 
decay over time; (3 and A can be chosen using line-search 
methods as discussed by Boyd & Vandenberghe (2004). In 
our experiments, we adopt a simple yet effective strategy, 
where (3 = and A = with 0 < c < 1. 

Step 1: Updating L Holding S fixed, the latent repository 
can be updated according to: 

X^+i = Li}- pf Vi,er(X, S) (eREINPORCE) 

X^+i =L 0 - S 0 )y/LeriL, S) (eNAC) 

with learning rate G ffi, and S) as the inverse 

of the Pisher information matrix (Peters & Schaal, 2008b). 

In the special case of Gaussian policies, the update for L 

^In Sect. 4.2, we linearize the loss around the constrained so¬ 
lution of the previous round to increase stability and ensure con¬ 
vergence. Given the linear losses, it suffices to solve the Bregman 
divergence over the regularize^ reducing the computational cost. 
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can be derived in a closed form as i/ 3+1 = i, where 

^ nt- Mt.-l 

7?t, 


Zl — 2/r2-f<ifexdfc I / , 2 

n* erf 

1=1 'j tj k=l m=0 




nt,- Mt-—1 


E 'ltj 

-% 


.3 m—0 


and $ = 


^ ^VeC 

ice < 


cr^ is the covariance of the Gaussian policy for a task tj, 


denotes the state features. 


Step 2: Updating S Given the fixed basis L, the coeffi¬ 
cient matrix S is updated column-wise for all tj G Xr'. 

er{L, S) (eREINFORCE) 

4+1 = 4+1 - (i/3, Sp)Vs,^ e,(i, 5) (eNAC) 

with learning rate 77 ^ S K. For Gaussian policies, the 
closed-form of the update is St- = Zj^Vs^ , where 




ntj Mt.-l 


^st. — 2/il/fcxfc + 

”<, = E /ii: . 

*j t 


tk—tj ^ ^3 k—1 m=0 

nt- Mt.-l 


tk-t- 


3 k—1 m—0 


4 . 2 . Constrained Policy Solution 

Once we have obtained the unconstrained solution 9r+i 
(which satisfies Eq. (7), but can lead to policy param¬ 
eters in unsafe regions), we then derive the constrained 
solution to ensure safe policies. We learn a projection 
Projj^^ k: (^i-+i^ from 9r+i to the constraint set: 

©r+i = argmin,Bo,,/c ^r+i) , (8) 

where Bci^x ^r+i^ is the Bregman divergence over fir'. 

Bntx(^y^r+l^ = ^r{9) — ^r{9r+l) 

— trace (0) _ ^0 — 0 ^+ 1 ^^ . 

Solving Eq. ( 8 ) is computationally expensive since Vlr{9) 
includes the sum back to the original round. To remedy this 
problem, ensure the stability of our approach, and guar¬ 
antee that the constrained solutions for all observed tasks 
lie within a bounded region, we linearize the current-round 
loss function lt^{9) around the constrained solution of the 
previous round 9^ '. 


Given the above linear form, we can rewrite the optimiza¬ 
tion problem in Eq. ( 8 ) as: 

4+1 = argminSf^^^K (^,4+i) ■ (10) 

Consequently, determining safe policies for lifelong policy 
search reinforcement learning amounts to solving: 

min Trills'll^-I-/Xsll-i^IlF 

J-ijO 


2 / 3,2 trace L 


+ 2 / 3 itrace 
s.t. At Lst < bt 


^tj G Xj. 


LL^ < pi and LL^ > ql . 


To solve the optimization problem above, we start by con¬ 
verting the inequality constraints to equality constraints 
by introducing slack variables c* >0. We also guaran¬ 
tee that these slack variables are bounded by incorporating 

IIqJI < Cmax, ytj G |r|}: 


2/32trace L 


L ) -I- 2 / 3 itrace S 


s.t. Af.Lst- = bt - - ct- Wtj G Xr 

> 0 and ||ctj |2 < ^max Vfj G Xr 

LL^ < pi and LL^ > ql . 

With this formulation, learning Proj^j^ ^ 0 ^+ 1 ^ amounts 
to solving second-order cone and semi-definite programs. 

4.2.1. Semi-Definite Program for Learning L 

This section determines the constrained projection of the 
shared basis L given fixed S and C. We show that L can be 
acquired efficiently, since this step can be relaxed to solving 
a semi-definite program in LL^ (Boyd & Vandenberghe, 
2004). To formulate the semi-definite program, note that 


trace ( 


k 


(i) ' 

r+1 
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i(0 

V+i 
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0r -\-1 
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EII^ 
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Or+l 
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trace (LL^) . 
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ftr 




Vehr {0) 
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0^ 
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(0) 
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0r- J 
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From the constraint set, we recognize: 

„T rT 


^iL^Lst^ = aj^at^ with - c* 
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Algorithm 1 Safe Online Lifelong Policy Search 
1: Inputs: Total number of rounds R, weighting factor 
77 = ^/Vr, regularization parameters pi and /i 2 , con¬ 
straints p and q, number of latent basis vectors k. 

2: S = zeros(fc, |T|), L = diag^,(^) withp < < q 

3: for j = 1 to ii do 

4: tj •(— sampleTask(), and update 

5: Compute unconstrained solution 0^+1 (Sect. 4.1) 

6 : Fix S and C, and update L (Sect. 4.2.1) 

7: Use updated L to derive S and C (Sect. 4.2.2) 

8 : end for 

9: Output: Safety-constrained L and S 


Since spectrum {LL^^ = spectrum L'^, we can write: 

\/trace (X) 


min /r 2 trace(X) -f 2 /i 2 
XCS++ 


L 



0r+l 


S.t. sj-^XSt^ = CttVott 

X < pi and X > ql 


'^tj C Rr 


with X = L'L . 


4.2.2. Second-Order Cone Program for 
Learning Task Projections 


Having determined L, we can acquire S and update C 
by solving a second-order cone program (Boyd & Vanden- 
berghe, 2004) of the following form: 


min 


r r 

i=i i=i 



s.t. Af.Lsf. = bf — Cf 

Lj Vj Oj Lj 

o, >0 IIqJI^ < Vfj G . 


5. Theoretical Guarantees 


losses for policy search RL are too restrictive given a single 
operating point, as discussed previously, we remedy this 
problem by generalizing to the case of piece-wise linear 
losses, where the linearization operating point is a resultant 
of the optimization problem. To bound the regret, we need 
to bound the dual Euclidean norm (which is the same as the 
Euclidean norm) of the gradient of the loss function, then 
prove Theorem 1 by bounding: (1) task fj’s gradient loss 
(Sect. 5.1), and (2) linearized losses with respect to L and 
S (Sect. 5.2). 


5.1. Bounding tj’s Gradient Loss 


We start by stating essential lemmas for Theorem 1; due to 
space constraints, proofs for all lemmas are available in the 
supplementary material. Here, we bound the gradient of a 
loss function (0) at round r under Gaussian policies^. 
Assumption 1. We assume that the policy for a task tj is 
Gaussian, the action set lA is bounded by rtmax. o-nd the 
feature set is upper-bounded by $max- 


Lemma 1. Assume task tj’s policy at round r is given by 


(tj) f.{k, tj) lAk, t. 


\( (k, 


£Cri 


= A/” a 
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CTt 


for states Xm ^ G A), and actions Um ^ G lAt ■ For 


k—1 m—Q 






gradient 
Mt ' 


satisfies 


Vat 


, the 
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2 Wmax + max 
ai \ tkGir- 


{\\K\\2' 


I A; 112 ^max )}$ 

max I ^11 


for all trajectories and all tasks, with Un 


max • 

k,', 


I I an£/4>max = m^| |^|. 


This section quantifies the performance of our approach by 
providing formal analysis of the regret after R rounds. We 
show that the safe lifelong reinforcement learner exhibits 
sublinear regret in the total number of rounds. Eormally, 
we prove the following theorem: 


5.2. Bounding Linearized Losses 

As discussed previously, we linearize the loss of task R 
around the constraint solution of the previous round 0^.. To 
acquire the regret bounds in Theorem 1, the next step is to 


Theorem 1 (Sublinear Regret). After R rounds and choos- 

1 r _ (liagj^((f), with 


ing 'Atj G Ir ijtj = T] = ^, L 


ei 


bound the dual norm, 
can be easily seen 
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L, p < < q, and S 

forcement learner exhibits sublinear regret of the form: 


ofEq. (9). It 


( 11 ) 


R 

E 

i=i 


kj - kj (u) = O for any u G IC. 


Lemma 2 


VeltA^) 


X 0 r 
Lemma 3 


Proof Roadmap: The remainder of this section completes 
our proof of Theorem 1; further details are given in Ap¬ 
pendix B. We assume linear losses for all tasks in the con¬ 
strained case in accordance with Sect. 4.2. Although linear 


^Please note that derivations for other forms of log-concave 
policy distributions could be derived in similar manner. In this 
work, we focus on Gaussian policies since they cover a broad 
spectrum of real-world applications. 
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Since 


lu (0) 


can be bounded by Si^ (see Sect. 2), 


the next step is to bound 


VeK (^) 


, and \\9r 


Lemma 2. The norm of the gradient of the loss function 
evaluated at 6^ satisfies 
2 


(®) 


max 


< 


(9) 


^tk 


\^tk 1 12 ”1” *-max 


q X d 


+ 1 


To finalize the bound of 


ft, 


as needed for deriving 


the regret, we must derive an upper-bound for || 0 r|| 2 : 
Lemma 3. The L 2 norm of the constraint solution at round 
r — 1, ll^rlli bounded by 


l^rWt q 'X d 


1 -|- \Ir-l I -IS 


max 

tk^Xr—1 


jt 


l^tfe 1 12 t-max) 


where |Ir-i | is the number of unique tasks observed so far. 
Given the previous two lemmas, we can prove the bound 
for /(,, . : 

Sr 2 

Lemma 4. The L 2 norm of the linearizing term of It^ {9) 


around 9 

ft 


is bounded by 


< 


Vekxe) 


(1+II0.II2) 


< Ji{r) {I + 72{r)) + , 

where Si^ is the constant upper-bound on 

7i(’’) = ^ 


kXe) 


ItM 


( 12 ) 


and 




+ “ax { I I A+J I 2 (I |bt J I2 -b C^ax) } ^ max I ^ max 

tk^Tr-l ^ f 


(ll^tJli + cLj} + \/^ 

72(r) < sjqxd 

“f \/l^i-1 I 1 A ^ 2 , ^ I '^tk (I I I2 

y J) 1 I 2 


5.3. Completing the Proof of Sublinear Regret 

Given the lemmas in the previous section, we now can de¬ 
rive the sublinear regret bound given in Theorem 1. Using 


results developed by Abbasi-Yadkori et al. (2013), it is easy 
to see that 

VgOo — VgOo = Vtjftj . ■ 

From the convexity of the regularize^ we obtain: 

1 




We have: 






^3 ~ ^3 + t 

VI 

C 4 

/t. 

So 


Therefore, for any u G K. 

r r 


i=i 


i=i 


+ f2o(M) — r2o(^i) ■ 


Assuming that Vtj pt = V, we can derive: 




i=i 


i=i 




The following lemma finalizes the proof of Theorem 1 : 
Lemma 5. After R rounds withy tj pt, = p = for any 
u G 1C we have that hj {^j) — hj {u) < O 

Proof From Eq. (12), it follows that 


Or 


<iz{R) + H{R)il{R) 


?-ll 


iz.{R) + ^^il{R)qd\^ + \Rr- 

_^{ll<l|2(||b*J|2+C„ax)"}^ 
with 73 (i?) = 47i(i?) + 2maxt^-gi^_j . Since 


< 


X max 

tk^Xn-i 


< |T|, we have that 


< 75 (i?)|T| with 


75 = S,d/p^q7f{R) ^ max |\\AIJl i\\btj \2 + Cn,ax)H- 

Given that Xlo{u) < qd 75 (i?)|T|, with 75 (i?) being a 
constant, we have: 


/• T 

<vY.^5{R)\r\ 

■ 1 

(qd + 75(i?)|r|-f2o(0l)) . 


i=i 


Initializing L and S: We initialize L 


^ =diagfc(C),with 


P < Cf < q and S 


= Ofcx|T| to ensure the invertibility 
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of L and that the constraints are met. This leads to 

r r 

(0,) 

i=i i=i 

+ V'? {qd + 75(i?)|T| - ^i 2 kC,) ■ 
Choosing Vf j rjt = rj = "^/Vr, we acquire sublinear regret, 
finalizing the statement of Theorem 1 ; 

r 

+ VR{qd + J5{R)\r\-^^2kC) 

< Vr(j 5 {R)\T\ + qd'y 5 {R)\T\ - fj. 2 kcj 

<o(yi?) . □ 

6. Experimental Validation 

To validate the empirical performance of our method, we 
applied our safe online PG algorithm to learn multiple con¬ 
secutive control tasks on three dynamical systems (Fig¬ 
ure 1). To generate multiple tasks, we varied the parameter¬ 
ization of each system, yielding a set of control tasks from 
each domain with varying dynamics. The optimal control 
policies for these systems vary widely with only minor 
changes in the system parameters, providing substantial di¬ 
versity among the tasks within a single domain. 



Figure 1. Dynamical systems used in the experiments: a) simple 
mass system (left), bj cart-pole (middle), and c) quadrotor un¬ 
manned aerial vehicle (right). 


Simple Mass Spring Damper: The simple mass (SM) 
system is characterized by three parameters: the spring con¬ 
stant k in N/m, the damping constant d in Ns/m and the 
mass m in kg. The system’s state is given by the position x 
and X of the mass, which varies according to a linear force 
F. The goal is to train a policy for controlling the mass in 
a specific state gref = (tCref, iref)- 

Cart Pole: The cart-pole (CP) has been used extensively 
as a benchmark for evaluating RL methods (Busoniu et al., 
2010). CP dynamics are characterized by the cart’s mass 
rric in kg, the pole’s mass rrip in kg, the pole’s length in 
meters, and a damping parameter d in Ns/m. The state is 
given by the cart’s position x and velocity x, as well as the 
pole’s angle 0 and angular velocity 6. The goal is to train a 
policy that controls the pole in an upright position. 

6.1. Experimental Protocol 

We generated 10 tasks for each domain by varying the sys¬ 
tem parameters to ensure a variety of tasks with diverse op¬ 


timal policies, including those with highly chaotic dynam¬ 
ics that are difficult to control. We ran each experiment for 
a total of R rounds, varying from 150 for the simple mass 
to 10, 000 for the quadrotor to train L and S, as well as 
for updating the PG-ELLA and PG models. At each round 
j, the learner observed a task tj through 50 trajectories of 
150 steps and updated L and The dimensionality k of 
the latent space was chosen independently for each domain 
via cross-validation over 3 tasks, and the learning step size 
for each task domain was determined by a line search after 
gathering 10 trajectories of length 150. We used eNAC, a 
standard PG algorithm, as the base learner. 

We compared our approach to both standard PG (i.e., 
eNAC) and PG-ELLA (Bou Ammar et al., 2014), examin¬ 
ing both the constrained and unconstrained variants of our 
algorithm. We also varied the number of iterations in our al¬ 
ternating optimization from 10 to 100 to evaluate the effect 
of these inner iterations on the performance, as shown in 
Eigures 2 and 3. Eor the two MTL algorithms (our approach 
and PG-ELLA), the policy parameters for each task tj were 
initialized using the learned basis (i.e., ott = Lst ). We 
configured PG-ELLA as described by Bou Ammar et al. 
(2014), ensuring a fair comparison. Lor the standard PG 
learner, we provided additional trajectories in order to en¬ 
sure a fair comparison, as described below. 

Lor the experiments with policy constraints, we generated 
a set of constraints {At, bt) for each task that restricted the 
policy parameters to pre-specified “safe” regions, as shown 
in Ligures 2(c) and 2(d). We also tested different values for 
the constraints on L, varying p and q between 0.1 to 10; 
our approach showed robustness against this broad range, 
yielding similar average cost performance. 

6.2. Results on Benchmark Systems 

Ligure 2 reports our results on the benchmark simple mass 
and cart-pole systems. Ligures 2(a) and 2(b) depicts the 
performance of the learned policy in a lifelong learning set¬ 
ting over consecutive unconstrained tasks, averaged over 
all 10 systems over 100 different initial conditions. These 
results demonstrate that our approach is capable of outper¬ 
forming both standard PG (which was provided with 50 
additional trajectories each iteration to ensure a more fair 
comparison) and PG-ELLA, both in terms of initial perfor¬ 
mance and learning speed. These figures also show that the 
performance of our method increases as it is given more 
alternating iterations per-round for fitting L and S. 

We evaluated the ability of these methods to respect safety 
constraints, as shown in Ligures 2(c) and 2(d). The thicker 
black lines in each figure depict the allowable “safe” region 
of the policy space. To enable online learning per-task, the 
same task tj was observed on each round and the shared 
basis L and coefficients St were updated using alternating 
optimization. We then plotted the change in the policy pa- 
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» 

0 
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0 


-B- Standard PQ 
-i-PG-ELLA 
» Safe PG 50 Iterations 
-o-Safe PG 100 Iterations 


(c) Trajectory Simple Mass 



(d) Trajectory Cart Pole 


Figure 2. Results on benchmark simple mass and cart-pole systems. Figures (a) and (b) depict performance in lifelong learning scenarios 
over consecutive unconstrained tasks, showing that our approach outperforms standard PG and PG-ELLA. Figures (c) and (d) examine 
the ability of these method to abide by safety constraints on sample constrained tasks, depicting two dimensions of the policy space {a\ 
vs 02 ) and demonstrating that our approach abides by the constraints (the dashed black region). 


rameter vectors per iterations (i.e., cxt = Lst ) for each 
method, demonstrating that our approach abides by the 
safety constraints, while standard PG and PG-ELLA can 
violate them (since they only solve an unconstrained opti¬ 
mization problem). In addition, these figures show that in¬ 
creasing the number of alternating iterations in our method 
causes it to take a more direct path to the optimal solution. 

6.3. Application to Quadrotor Control 

We also applied our approach to the more challenging do¬ 
main of quadrotor control. The dynamics of the quadro¬ 
tor system (Ligure 1) are influenced by inertial constants 
around Bi b, £ 2 , 3 , and £3 s, thrust factors influencing how 
the rotor’s speed affects the overall variation of the system’s 
state, and the lengths of the rods supporting the rotors. Al¬ 
though the overall state of the system can be described by 
a 12-dimensional vector, we focus on stability and so con¬ 
sider only six of these state-variables. The quadrotor sys¬ 
tem has a high-dimensional action space, where the goal is 
to control the four rotational velocities {wi}f^i of the ro¬ 
tors to stabilize the system. To ensure realistic dynamics, 
we used the simulated model described by (Bouabdallah, 
2007; Voos & Bou Ammar, 2010), which has been verified 
and used in the control of physical quadrotors. 

We generated 10 different quadrotor systems by varying 
the inertia around the x, y and z-axes. We used a linear 
quadratic regulator, as described by Bouabdallah (2007), 
to initialize the policies in both the learning and testing 
phases. We followed a similar experimental procedure to 
that discussed above to update the models. 

Ligure 3 shows the performance of the unconstrained solu¬ 
tion as compared to standard PG and PG-ELLA. Again, our 
approach clearly outperforms standard PG and PG-ELLA 
in both the initial performance and learning speed. We 
also evaluated constrained tasks in a similar manner, again 
showing that our approach is capable of respecting con¬ 
straints. Since the policy space is higher dimensional, we 
cannot visualize it as well as the benchmark systems, and so 
instead report the number of iterations it takes our approach 



Figure 3. Performance on quadrotor control. 



Figure 4. Average number of task observations before acquiring 
policy parameters that abide by the constraints, showing that our 
approach immediately projects policies to safe regions. 


to project the policy into the safe region. Ligure 4 shows 
that our approach requires only one observation of the task 
to acquire safe policies, which is substantially lower then 
standard PG or PG-ELLA (e.g., which require 545 and 510 
observations, respectively, in the quadrotor scenario). 

7. Conclusion 

We described the first lifelong PG learner that provides sub- 
linear regret 0{'/R) with R total rounds. In addition, our 
approach supports safety constraints on the learned policy, 
which are essential for robust learning in real applications. 
Our framework formalizes lifelong learning as online MTL 
with limited resources, and enables safe transfer by sharing 
policy parameters through a latent knowledge base that is 
efficiently updated over time. 
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A. Update Equations Derivation 

In this appendix, we derive the update equations for L and S in the special case of Gaussian policies. Please note that these 
derivations can be easily extended to other policy forms in higher dimensional action spaces. 


For a task t 


f, the policy is given by; 




Min’ 


27rtT? 


exp - 


\ ^3 


Therefore, the safe lifelong reinforcement learning optimization objective can be written as; 


E - (Lstf ^ (13) 


2cr? Tit 

j = l L k=l m=0 

To arrive at the update equations, we need to derive Eq. (13) with respect to each L and S. 

A.l. Update Equations for L 

Starting with the derivative of er{L, S) with respect to the shared repository L, we can write; 
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To acquire the minimum, we set the above to zero; 
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To solve Eq. (14), we introduce the standard vec( ) operator leading to: 
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Knowing that for a given set of matrices A, B, and X, vec{AXB) = 0 A) vec(X), we can write 




E E E ® + 2^^,vec{L) 

j=l k=l m=0 


j=l ‘3 fc=l m=0 


By choosing Zj, = 2fi2ldkxdk + Ei=i EDi EEo ^ vec sj,) 0 sj,)^ 

EjEi Efc=i EEo ^ vec we can update L = Z^^vl- 


and vl = 


A.2. Update Equations for S 

To derive the update equations with respect to S, similar approach to that of L can be followed. The derivative of er{L, S) 
with respect to S can be computed column-wise for all tasks observed so far: 
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Using a similar analysis to the previous section, choosing 
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we can update St- = . 


B. Proofs of Theoretical Guarantees 

In this appendix, we prove the claims and lemmas from the main paper, leading to sublinear regret (Theorem 1). 
Lemma 1. Assume the policy for a task tj at a round r to be given by Tfal^ ^ 
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with Umax = wiax-k.m | |ttm’ ^ | and ^max = maxfc^m II (^Xm’ I/or all trajectories and all tasks. 

Proof. The proof of the above lemma will be provided as a collection of claims. We start with the following: 
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Proof: Since 
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Using the Cauchy-Shwarz inequality (Horn & Mathias, 1990), we can upper bound max^ ^ 
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Finalizing the statement of the claim, the overall bound on the norm of the gradient of Itj {at -) can be written as 
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Claim: The norm of the gradient of the loss function satisfies: 
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Combining the above results with those of Eq. (16) we arrive at 
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The previous result finalizes the statement of the lemma, bounding the gradient of the loss function in terms of the safety 
constraints. 
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To relate Iji+Hj to ||T||p, we need to bound 


L^L) 


in terms of ||T||f- Denoting the spectrum of L as 


spec = {Ai,...,Afc} such that 0 < Ai < • • • < A^, then spect|^(Z/^L) = {i/ai, ..., i/xt} such 
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Proof. We have previously shown that 
the previously derived lemmas we can upper-bound 


ft. 

Or- 

< 

2 

Ve(t, (0) 

Or 

+ 

2 

It, (i) 

-f 

^elt,{0) 

Or 

X 

2 

Or 


ft 


as follows 


Veit, (0) 


2 

< 

VdJt, (0) 


Or 

2 


Or 

Veit, {0) 

Or 

< 

2 

Vci,ft, (0) 

Or 


g X a ^ max 

2 \ \P^ tk&Ir-l 




^tfcll2 + ‘^max 


-f 1 


< 




Umax + niax I I II2 (||btfc||2 + Cinax) } 

tkE-Lr-l 


{ll^Llli (ll^iJIi + 4ax)} + Vod 


Further, 


< q X d + \Tr-i\ max 
2 


1 

1 H—^ max 

tk&Ir-l 


at 


1 12 + Cnaax) 


< \/q X d + y/l^r-l I 


1 H —K max 

p^ tkeir-i 


ifc 


^ik 112 ^max 


Therefore 


ft 


< 


^ehAO) 


(^+ll®’'ll 2 ) 


itA0) 


<7i(r)(l + 72(r)) + 5i ., 


with Si^ being the constant upper-bound on 

1 


itA0) 


, and 


7iW = 


nt.crf [ 


■Umax + I I II 2 (||btfc||2 + Cmax) } 


tk GXr — 


(^rf/pV^y^max^ {ll^Llli (ll^tjli + 4ax)} + Vdd 


72(r) < \/qx~d + \/Ar-i\ 


1-1 —TT max 

P^ tk&Ir-l 


at 


1 12 ^max) 


. Using 


(0) ~ t^xA^ {ll"^lll2(ll^‘<=ll2 +Cmax)} + 


( 20 ) 


□ 


Theorem 1 (Sublinear Regret; restated from the main paper). After R rounds and choosing = ''' = Vtj = p = 


= diagAC)’ with diagA') being a diagonal matrix among the k columns of L, p < Cf < q, and S 
Si 

any u G K, our algorithm exhibits a sublinear regret of the form 
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Proof Given the ingredients of the previous section, next we derive the sublinear regret results which finalize the statement 
of the theorem. First, it is easy to see that 
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+ iio('w) — r2o(0i) • 


Finally, for any u G IC,we have: 

{ it , { Oj ) - ^ Vt , ( A 

j=i j=i L ^ 

Assuming rjt, = ‘ ‘ ‘ = Vtj = we can derive 

r r / \ 2 

E 0 *. L. J +• 

The following lemma finalizes the statement of the theorem: 

Lemma 5. After T rounds and for rjt, = ‘ ‘ ‘ = Vij = rj = our algorithm exhibits, for any u G 1C, a sublinear regret 
of the form 

R 

Y,it,iej)-itAu)<o{VR) . 


Proof. It is then easy to see 
|2 


ft, 



Gr 


< 73 (-R) + 47 i(i?) 7 |(i?) with 73 (i?) = 47 ^ (i?) + 2 max S^. 




< JaiR) + 8-^7i{R)<ld + 8-^-yf{R)qd\lR-i\ max ||1AJJjz (||btj|2 + Cmax)^} • 

tkeiR-i L '= J 


Since |Tp-i| < |T| with |T| being the total number of tasks available, then we can write 


ft 


<75(i?)ir| , 


with 75 = 8 Vp''(? 7 i(-R)maxtj^gXfl_i {\\btJ 2 + Cmax)^}-Further, it is easy to see that no{u) < qd + ^ 5 {R)\T\ 

with 75 (R) being a constant, which leads to 

r r 

EOl (a) <^E'y5(^)l'^l + '^(9'^ + 75(i?)|r|-f^o(A) 


J =1 


Initializing L and S: We initialize L 




= diag^(C), with p < < q and S\. — Ofcx|r| ensures the invertability of L 

101 101 

and that the constraints are met. This leads us to 

r r 

EO*. {ej)-itAu))<vJ2^5{R)\r\ + yv{qd + 75iR)\r\-P2k(:) . 

j=i i=i 

Choosing pti = ■ ■ ■ = Vtj = p = ^/C~R, we acquire sublinear regret, finalizing the statement of the theorem: 

r 

E O 7 -^cN) < y^i5{R)\r\R + VR{qd + j5{R)\T\-p2kO 

< y]?(75(i?)|r| + gd75(i?)|r| -P2fcc) < o {Vr) , 




with 75 (R) being a constant. 
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